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Sampling hidden populations is particularly challenging using standard sampling methods mainly 
because of the lack of a sampling frame. Respondent-driven sampling (RDS) is an alternative 
methodology that exploits the social contacts between peers to reach and weight individuals in 
these hard-to-reach populations. It is a snowball sampling procedure where the weight of the 
respondents is adjusted for the likelihood of being sampled due to differences in the number of 
contacts. In RDS, the structure of the social contacts thus defines the sampling process and 
affects its coverage, for instance by constraining the sampling within a sub-region of the network. 
In this paper we study the bias induced by network structures such as social triangles, community 
structure, and heterogeneities in the number of contacts, in the recruitment trees and in the RDS 
estimator. We simulate different scenarios of network structures and response-rates to study the 
potential biases one may expect in real settings. We find that the prevalence of the estimated 
variable is associated with the size of the network community to which the individual belongs. 
Furthermore, we observe that low-degree nodes may be under-sampled in certain situations if the 
sample and the network are of similar size. Finally, we also show that low response-rates lead to 
reasonably accurate average estimates of the prevalence but generate relatively large biases. 


I. Introduction 

In order to estimate the prevalence of diseases, traits or 
behaviors in particular social groups or even in the entire 
society, researchers typically rely on samples of the target 
population. A carefully selected sample may generate satis¬ 
factory low standard errors with a bonus of optimizing re¬ 
search resources and time. A common challenge is to obtain 
a significant and unbiased sample of the target population. 
This is particularly difficult if this population of interest is 
somehow segregated, stigmatized, or in some other way dif¬ 
ficult to reach such that a sampling frame cannot be well de¬ 
fined. These so-called hidden (or hard-to-reach) populations 
may be for example man-who-have-sex-with-man (MSM), 
sex-workers, injecting drug users, criminals, homeless, or mi¬ 
nority groups^. 

In 1997, Heckathorn introduced a new methodology to 
sample hidden populations named respondent-driven sam¬ 
pling (RDS)^. RDS exploits the underlying social network 
structure in order to reach the target population through the 
participants’ own peers. The method consists in a variation 
of the snowball sampling where the statistical estimators 
have weights to compensate the non-random nature of the 
recruiting process, i.e. that individuals with many potential 
recruiters have a higher chance to be sampled. In RDS, re¬ 
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searchers select seeds to start the recruitment. A seed person 
then invites a number of other individuals to participate in 
the survey by passing a coupon to them. Those successfully 
recruited respond a survey and get new coupons to invite 
a number of other individuals within their own social net¬ 
work, and the process is repeated until enough participants 
are recruited. Successful recruitment and participation in 
the survey are both financially compensated. A fundamen¬ 
tal assumption is that each participant knows the number of 
his or her own acquaintances in the target population, or in 
the network jargon, his or her own degree. This information 
is used as weights to estimate the prevalence of the variable 
of interest in the study population. 

The perhaps most popular RDS statistical estimator is 
due to Volz and Heckathorn, who devised a Markov process 
whose equilibrium distribution is the same as the distribu¬ 
tion of the target population^. This estimator is derived 
after a series of assumptions regarding both the underlying 
network structure and the recruitment process per se. Al¬ 
though the assumptions are generally reasonable, sometimes 
they are relatively strict for realistic settings, as for example, 
the uniformly random selection of peers, persistent success¬ 
ful recruitment, and sampling with replacement'^. These and 
other assumptions have been scrutinized in previous theo¬ 
retical studies and the estimator has performed satisfactory 
in different scenarios using both synthetic^^^ and real net¬ 
works®’®. A number of real life studies have also concluded 
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Figure 1: Networks with and without structure. The panel shows a schematic network a) completely random, i.e. without triangles and community 
structure, b) with 5 triangles and without community structure, and c) with 4 network communities and 1 triangle (in the bottom-right community). 


that RDS is an effective sampling method for various cate- 
gories of hidden populations (See for example Refs.^’^°“^^). 

Social networks are however highly heterogeneous in the 
sense that the structure of connections cannot be represented 
by characteristic values. This is the case of the number of 
contacts per individual or of the level of clustering between 
them^'^’^®. Since the RDS dynamics is constrained by the 
network structure, one may expect that different patterns of 
connectivity affect the recruitment chains. For example, the 
network structure may be such that a recruitment tree grows 
only in one part of the network^®^^®. In realistic settings 
using sampling without replacement, even if all individuals 
are willing to participate, trees may simply die out because 
a network has been locally exhausted and bridging nodes 
block further propagation of coupons to other parts of the 
network^®. Such situation is not unlikely in highly clustered 
sub-populations where coupons may simply move around 
the same group of people. Previous theoretical studies have 
addressed some of these network constrains by studying the 
RDS performance on either synthetic structures®’^ or sam¬ 
ples of real networks®’®. Each approach to model social net¬ 
works has its own advantages and limitations. On one hand, 
simple synthetic structures and sampling processes are un¬ 
realistic but allows some mathematical treatability and thus 
intuitive understanding. On the other hand, samples of real 
networks may suffer biases themselves due to their own sam¬ 
pling and thus potential incompleteness of data®®’®^. 

Network clustering is particularly important in the con¬ 
text of social networks and should be carefully assessed. It 
may have different meanings but here we associate clustering 
to social triangles, i.e. the fact that common contacts of a 
person are also in contact themselves. Network communities 
are also a form of clustering in which groups of individuals 
are more connected between themselves than with individ¬ 
uals in other groups. As already mentioned, clustering in 
all its forms is not uniform across a network. In practice, 
it means that one may find hidden sub-populations within 
the study population. Examples include social groups with 
particular features (e.g. wealth, foreigners, ethnic minori¬ 
ties) embedded in the target population^®, transsexuals in 
populations of MSM, or geographically sparse populations^^. 
While these sub-populations may potentially be removed by 
defining a more strict sampling frame, social groups (or com¬ 
munities) are inherent of social and other human contact 
networks^®’®®. Note that network clustering is not the same 


as homophily, that is the tendency of similar individuals 
to associate, but one may enhance the other. For exam¬ 
ple, individuals may share social contacts because they live 
geographically close, share workplaces, or are structured in 
organizations (potentially leading to network clustering) but 
may be completely different in other aspects (low homophily 
in wealth, health status, gender, infection status, and so on). 

In this paper, we use computational algorithms to gener¬ 
ate synthetic networks with various levels of clustering and 
with network communities of various sizes, aiming to re¬ 
produce structures observed in real social networks. Using 
realistic parameters, we simulate a RDS process using these 
networks and quantify the performance of the RDS estima¬ 
tor in different scenarios of the prevalence of an arbitrary 
variable of interest. The paper is organized such that we 
first analyze how triangles and community structure affect 
how the RDS spread in the network when it comes to size of 
transmission trees and generation of recruitment. Then we 
investigate how clustering affects the validity and reliability 
of the RDSII estimator as a function of different willingness 
to participate (response-rates) in the population. We also 
test the effect of clustering for scenarios where the variable 
under study is correlated with the degree of the nodes and 
the size of the network commnnity. Thereafter, we study 
the consequences of the biased selection of seeds, the bias 
induced by network structure in samples of real social net¬ 
works, and the effect of restarting the seeds during the sam¬ 
pling experiment. 

II. Materials and Methods 

We describe in this section the models used to generate the 
synthetic networks with different number of triangles and 
varying levels of community structure, the empirical net¬ 
works, the model to simulate the RDS dynamics, the pro¬ 
tocols to artificially distribute the infections in the target 
population, and the estimator and other statistics used for 
the analysis. 


A. Study networks 

A social network is defined by a set of nodes representing 
the population and a set of links representing the social con¬ 
tacts, as for example acquaintances or friendship, between 
two individuals. The network structure can be character- 
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ized by different network quantities^"*’^^. The most funda¬ 
mental quantity is the degree k that represents the number 
of links of a node or equivalently the number of contacts of 
an individual. The assortativity of a network measures the 
tendency of nodes with similar degree to be connected. The 
number of triangles and the clustering coefficient are used 
to measure the local clustering in the network. A triangle 
corresponds to the situation where two contacts of a node 
are also in contact themselves, and the clustering coefficient 
is a normalized count of the number of triangles. A network 
community, on the other hand, is a group of nodes that are 
more connected between themselves than with nodes of other 
groups. A fundamental property of the network community 
structure is that only a few nodes link (or bridge) different 
communities, these nodes are also known as bottlenecks be¬ 
cause they constrain the diffusion, or the sampling process, 
in the network. If there are only a few bridging nodes, one 
says that the community structure is strong, whereas many 
bridging nodes weaken the community structure reducing 
the bottlenecks between groups. 

1. Synthetic Networks 

We use computational algorithms able to generate syn¬ 
thetic networks with tunable number of triangles (Fig. lb) 
or of community structure (Fig. Ic). These algorithms are 
not expected to reproduce a particular social network but 
to generate various structures observed in social networks 
more realistically than previously studied structures®. Our 
reference random network is obtained by simply connecting 
pairs of nodes for a given degree sequence, a procedure that 
results on a negligible number of triangles and no network 
community structure (Fig. la). This model is also know as 
the configuration modeb^. 

The first algorithm, due to Serrano and Boguha^^, gen¬ 
erates networks with a varying number of triangles and as¬ 
sortativity. In this algorithm, an a priori degree sequence 
is chosen following a given distribution P{k) of node degree 
k. We choose a power-law degree distribution with a small 
exponential cutoff, i.e. P{k) oc ® exp(—O.OOOlfc). If no 
or very small costs are associated with keeping links alive, 
scale-free distributions are reasonable models for empirical 
distribution, otherwise we usually observe broad scale distri¬ 
butions not necessarily power-law-like. Generally speaking, 
this degree distribution is thus not expected to be the most 
appropriate distribution of contacts in real populations but 
it captures the right-skewed degree heterogeneity typically 
observed in social groups®’ This heterogeneity means 
that the majority of nodes has only a few contacts whereas 
a small number of them has several contacts. We fix the 
minimum possible degree to Xrain = 3 in order to obtain an 
average degree (k) ~ 7. Furthermore, an a priori clustering 
coefficient is chosen such that a given number of triangles 
is defined for each degree class k. The algorithm evolves 
by randomly selecting three different nodes and forming a 
triangle between them, respecting the distribution of trian¬ 
gles per degree class. As soon as no new triangles can be 
formed, the remaining links are uniformly connected (i.e. the 
configuration model) such that no links are left unconnected. 
Self-links are forbidden. A parameter j3 controls the assor¬ 


tativity (assortativity increases with decreasing /3) and the 
parameters cq and a control the expected clustering coeffi¬ 
cient (clustering increases with increasing cq and decreases 
with increasing a). In this paper, we use cq = 0.5, a = 0.3 
and /3 = I.O (for the configuration with many triangles) and 
Co = 0.5, a = 1.0 and /3 = 1.0 (for the configuration with 
few triangles). 

The second algorithm, developed by Lancichinetti and 
Fortunate^®, is used to create networks with community 
structure. Here one starts by choosing the distribution of 
degrees and the distribution of community sizes. In both 
cases, we use power-law distributions to capture the hetero¬ 
geneity in the degree and in the community size as observed 
in some real social networks^^’^®. Other choices of proba¬ 
bility distribution may be more suitable for specific popu¬ 
lations but here again we want to study the heterogeneity 
in the community sizes. The degree distribution has the 
same parameters as used in the first algorithm, the power- 
law distribution of community sizes has exponent —1 and 
community sizes are limited between 10 and 1000 nodes. 
These values are chosen to guarantee that a sufficient num¬ 
ber of communities are large in size and at the same time, 
enough small-sized communities are represented. For exam¬ 
ple, higher values of the exponent would result in relatively 
more small-sized communities. These values are also con¬ 
strained by the number of links and the level of overlapping 
of communities (see below), and are chosen to generate a 
network with a single connected component. The number 
of overlapping nodes and the number of communities that 
each node belongs to are inputs of the algorithm. Overlap¬ 
ping means that a number of nodes belong to more than 
one community (these are the bridging nodes) while the rest 
of the nodes only belong to single communities. One may 
further select a mixing parameter /i to add random links 
between the bridge nodes and randomly chosen communi¬ 
ties (to weaken the community structure). Therefore, small 
overlapping and small mixing generate stronger community 
structures. We set /r = 0, and select 100 or 1000 overlapping 
nodes in 5 communities respectively for strong and strong- 
moderate community structures. For moderate-weak and 
weak community structures, we set respectively fi = 0.3, 
and 100 and 1000 overlapping nodes (in 5 communities as 
well). 

For each algorithm, to obtain the statistics, we generate 
10 versions of the network with the same set of parameters 
and with 10000 nodes each, which is also the size of the 
target or study population. 

2. Empirical Networks 

We also study RDS using real-life networks. We per¬ 
form simulations on 5 samples of empirical contact net¬ 
works representing different forms of human social rela¬ 
tions. Three data sets correspond to email communication, 
two between members of two distinct universities in Europe 
(EMAl^®, EMA2^®) and one between employees of a com¬ 
pany (ENR)®®. In these datasets, nodes correspond to peo¬ 
ple and social ties are formed between those who have sent 
or received at least one email during a given time inter¬ 
val. One data set corresponds to friendship ties between 
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US high-school students (ADH)^^. The last data set cor¬ 
responds to online communication between members of an 
online dating site (POK)^^. Similarly to the email networks, 
if two members have exchanged a message through the on¬ 
line community, a link is made between the respective nodes. 
Although some of these data sets do not correspond to so¬ 
cial networks in which RDS would take place, they serve 
as realistic settings capturing the network structure of ac¬ 
tual social relations. We have selected data sets with diverse 
sample sizes and network structure in order to cover various 
contexts and configurations (Table 1). 



EMAl 

ADH 

EMA2 

POK 

ENR 

N 

1,133 

2,539 

3,186 

28,295 

36,692 

E 

5,451 

10,455 

31,856 

115,335 

183,831 

cc 

0.22 

0.15 

0.26 

0.05 

0.50 

C 

57 

200 

71 

2,615 

2,441 

Cs 

2 

1 

1 

1 

1 

Cl 

151 

222 

1,205 

2,621 

1,481 


Table 1: Summary statistics of the empirical networks used in this 
study. Number of nodes (Af); number of links (E); clustering coeffi¬ 
cient cc; number of communities C; size of the smallest community Cs 
and size of the largest community C'l, according to the MapEquation 
algorithm^'^. 


B. RDS model 

We simulate the sampling by using a stochastic process re¬ 
producing several features of a realistic RDS dynamics. Our 
model further adds a continuous-time framework and the 
response-rate can be controlled. We use similar parameters 
as typically used in the literature®’^^. 

We start by uniformly selecting (unless otherwise stated) 
10 random nodes as seeds for the recruitment. After a time t, 
sampled from an exponential distribution, each seed chooses 
uniformly three of its contacts and pass one coupon to each 
of them. The exponential distribution is chosen because in 
our model we assume that the recruitment follows a Poisson 
process. We select the average waiting time to be 5, mean¬ 
ing that a node waits on average 5 time steps (e.g. 5 days) 
before inviting its contacts. Therefore, after waiting t time 
steps, and with probability p, that represents the probability 
of participation (or response-rate, i.e. one minus the prob¬ 
ability of not returning a coupon), each of these contacts 
recruits three of their own contacts that have not partic¬ 
ipated yet (sampling without replacement). If a node ac¬ 
cepts to invite its own contacts (i.e. accepts to participate), 
we add this node in the sample. The process continues un¬ 
til all possibilities of new recruitments are exhausted or, at 
maximum, when a specific sample size is reached. Note that 
this continuous-time model is equivalent to a discrete-time 
model in which randomly chosen nodes update their status 
sequentially. We assume that if a node refuses to participate 
once, it becomes available for recruitment by other nodes as 
if it was never invited. We repeat the simulation of the RDS 
dynamics 50 times for each synthetic network and 500 times 
for each empirical network. 


C. Prevalence of the study variable 

In RDS studies, one is interested in quantifying the preva¬ 
lence of some variable A in the target population. This vari¬ 
able may represent, for instance, being tested positive for a 
given disease, being male or female, the ethnicity, or having 
a particular physical trait. In this paper, to simplify the 
notation, we say that an individual and its respective node 
is infected with A or not infected with A. We use different 
protocols to infect a fraction of 25% of the network nodes 
with the quantity A. The remaining nodes are thus assumed 
to be non-infected. 

The reference case (RI) corresponds to uniformly selecting 
the nodes within the target population, i.e. the infection A 
is uniformly distributed in the network. 

The preferential case (PI) corresponds to selecting nodes 
in decreasing order of degree, i.e. we start at nodes with 
the highest degree and infect them with A until 25% of the 
nodes become infected. To add some noise (case PRI), we 
select 20% of the infected nodes, cure them, and redistribute 
these infections uniformly in the network such that the total 
number of infected nodes remains fixed. 

The other two cases consist on infecting nodes according 
to the community structure. In the first case (SI), we ini¬ 
tially infect nodes in the smallest communities until 25% of 
the nodes become infected. In the second case (BI), we in¬ 
fect nodes in the largest communities until the same fraction 
of 25% of nodes get infected. To reduce homophily, we add 
noise by selecting 40% of the infected nodes, curing them, 
and redistributing these infections uniformly in the network 
while keeping the total number of infected nodes fixed (these 
configurations are named SRI for small and BRI for large 
communities). 


D. Statistics 

To analyze the recruitment trees, we measure the total 
number of participants D (i.e. the sample-size), and the size 
Si and the number of generations (or waves) Wi of each 
recruitment tree, starting from a seed node i. 

The proportion of individuals in the population with a 
certain feature A (Pa) is estimated by using the RDSII es¬ 
timator^: 


E 

_ ieAnN 


( 1 ) 


iGN 

where ki is the reported degree of an individual i in the 
social network. We thus define: 


m 


i=i 



m 


( 2 ) 


as the average estimate of the prevalence of A for m sim¬ 
ulations with the same set of parameters, with standard de¬ 
viation given by a. Complementary, we define the average 
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Figure 2: Statistics of recruitment trees. The panel shows a.f.k.p.u) the total number of recruited subjects Q for different response-rates p (dotted 
lines correspond to zero, bars on points correspond to the standard error), the distribution of size of recruitment trees S per seed for response-rates 
b,g,l,q,v) p = 0.7 and c,h,m,r,w) p = 1.0 (histogram bin size is 100), and the distribution of number of waves W per seed for response-rates d,i,n,s,x) 
p = 0.7 and e,j,o,t,y) p = 1.0 (histogram bin size is 5). The underlying structures are random networks with different number of triangles and different 
levels of community structure (See Section II.A). 


bias (5, i.e. the difference between the estimate of the preva¬ 
lence of A and the true prevalence of A, for m simulations, 
as: 




\Pi-PA\ 


i=i 


( 3 ) 


In the results, we show the relative bias in respect to the 
true value of the prevalence, i.e. we show A = <5/0.25. The 
design effect^^ D.E. is defined as: 


ferent scenarios of prevalence of the infection A. This analy¬ 
sis is followed by results on the convergence of the estimator 
for increasing sample size on networks with strong commu¬ 
nity structure. Afterwards, we study the RDS performance 
considering the same scenarios of prevalence of the infection 
using real social networks and conclude the results section 
showing the increased bias as a consequence of running a 
single recruitment tree per time. 


A. Recruitment trees 


jj ^ ^ 1^Q^’(7a)rds 
Var{pA)sKS 


( 4 ) 


where Var{PA)KDS is the variance of the estimator Pa 
using RDS, and Var{PA)sRS is the variance of the same 
estimator Pa using simple uniform sampling (SRS), i.e. the 
same number of nodes (as in the RDS sample) is uniformly 
selected in the study population. The design effect thus 
measures the number of the sample cases necessary to obtain 
the same statistics as if a simple random sample was used. 
In our study, m = 500 (50 RDS simulations for each of the 
10 generated network with fixed parameters, and 500 RDS 
simulations for each of the empirical networks). 


III. Results 

We first discuss the statistics of recruitment trees for syn¬ 
thetic networks with various levels of clustering and com¬ 
munity structure. We then analyze the performance of the 
RDSII estimator for different network structures and for dif- 


We first look at some statistics of the recruitment trees in 
the case that the entire target population can potentially be 
recruited, i.e. the recruitment only stops if no new subject 
is recruited or if the network is exhausted (everyone is re¬ 
cruited). Since the population is fixed to 10000 individuals, 
this limiting case provides us the maximum possible cover¬ 
age of the sampling for a given configuration of the RDS. In 
the reference case (Fig. 2a-e), only the degree distribution is 
fixed and the nodes are uniformly connected (configuration 
model, see Section II.A). In this case, if every recruited indi¬ 
vidual responds to the survey, i.e. p = 1.0 (see Section II.B), 
nearly all the population is recruited. The recruitment dy¬ 
namics however is not robust to variations in the response- 
rate, for example, in our simulations, for p = 0.7, only about 
80% of the population is recruited, and this percentage falls 
to negligible values if p < 0.4^®’^®. Successful recruitment in 
fact occurs only if p > 0.35 in the absence of any (or negligi¬ 
ble) triangles and community structure. We observe a broad 
distribution in the size of the recrnitment trees (Fig. 2b,c). 
There is a relatively high chance for the recruitment trees 
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Figure 3: RDS estimates for networks with triangles. The panel shows a,d,g,j,m,p,s,v) the RDS estimator 8 (Eq. 2) and the respective standard devi¬ 
ation cr, b,e,h,k,n,q,t,w) the average bias A (Eq. 3), and c,f,i,l,o,r,u,x) the design effect D.E. (Eq. 4) in respect to the response-rate p. In the 1st and 
3rd columns, the underlying networks have no triangles and recruitment is limited respectively to 10000 and to 500 participants. In the 2nd and 4th 
columns, the networks have a large number of triangles and recruitment is limited respectively to 10000 and to 500 participants (See Section II.A). In 
all cases, 25% of the population is infected with A, either following the protocol Rl, i.e. infections are uniformly spread (top 3 rows), or protocol PRI, 
i.e. infections occur preferentially in high degree nodes (bottom 3 rows) (See Section II.C). Dotted horizontal lines are eye-guides. 


to break down quickly and thus to contain only a few in¬ 
dividuals. This typically happens when a recruitment tree 
reaches a high-degree node. High-degree nodes are easily 
reachable because they have many connections. Once the 
first recruitment tree passes through one of these high-degree 
nodes, they become unavailable. Consequently, the recruit¬ 
ment trees arriving afterwards simply die out as soon as 
they reach these nodes. At the same time, a few recruit¬ 
ment chains persist long enough and generate large trees, 
potentially sampling large parts of the network from a sin¬ 
gle initial seed. As expected, there is a characteristic peak 
in the number of waves (Fig. 2d,e). 

The increasing level of clustering has some effect in the 
statistics of the recruitment trees. In particular, in the ab¬ 
sence of communities, a large number of triangles improve re¬ 
cruitment for intermediate values of response-rates (Fig. 2f- 
j). Triangles create redundant paths eliminating bottlenecks 
in the network, as for example, bottlenecks due to high de¬ 
gree nodes. High degree nodes make a large number of con¬ 
tacts and thus connect different parts of the network. As 
mentioned before, as soon as these nodes are recruited, the 
recruitment chain may not be able to expand beyond them. 
On the other hand, if the network has weak (Fig. 2k) or 
strong (Fig. 2p,u) community structure, the number of trian¬ 


gles becomes irrelevant, and the level of community structure 
defines the sample size. In case of strong community struc¬ 
ture (with low or large number of triangles), a maximum 
of ~ 85% of the population may be recruited (Fig. 2p,u). 
Bottlenecks in this case correspond to nodes bridging com¬ 
munities. These bottlenecks cannot be removed by adding 
triangles, that only produce local network redundancy, but 
by connecting more nodes between different communities, 
i.e. weaken the community structure. Moreover, strong com¬ 
munities imply that response-rates should be higher (in com¬ 
parison to the absence of or to weaker communities) for the 
recruitment chains to take off and gather sufficient partici¬ 
pants. If response-rates are bellow p ~ 0.45, recruitment is 
insufficient. This is a fundamental issue in realistic settings, 
meaning that highly clustered (or in other words, highly 
segregated and marginalized) populations need a bit higher 
compensation in order to achieve the same sampling size as 
one would obtain if studying less segregated groups. 

We see that irrespective of the number of triangles or level 
of community structure, lower response-rates cause a rela¬ 
tively larger number of small recruitment trees together with 
a few waves (Fig. 2b,d,g,i,l,n,q,s,v,x). This is not only un¬ 
desirable because the final sample remains small but also 
because a few waves is not sufficient for the stochastic pro- 
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Figure 4: RDS estimates for networks with weak community structure. 

The panel shows a,d,g,j) the RDS estimator 6 (Eq. 2) and the respective 
standard deviation cr, b,e,h,k) the average bias A (Eq. 3), and c,f,i,l) the 
design effect D.E. (Eq. 4). The underlying networks have a few number 
of triangles and weak community structure (See Section II.A), and re¬ 
cruitment is limited respectively to 500 (1st column) and to 10000 (2nd 
column) participants. In both cases, 25% of the population is infected 
with A preferentially towards high degree nodes following either protocol 
PI (top 3 rows) or protocol PIR (bottom 3 rows) (See Section II.C). 


cess to forget the initial conditions and thus reach the sta¬ 
tionary state, the condition in which the estimator is ex¬ 
pected to be unbiased. In case of strong community struc¬ 
ture (Fig. 2s,t,x,y), we note a broader variance in the num¬ 
ber of waves suggesting that each seed samples the network 
non-homogeneously. This may be related to the fact that 
the communities have different sizes (or number of nodes) 
and thus the bottlenecks between communities are reached 
at different times by different recruitment chains. 


B. RDS estimates and structure-induced bias 

To study the impact of the network structure and 
response-rates in the RDS estimator, we measure four statis¬ 
tics: (i) the average RDSII estimator 9 (Eq. 2) and its re¬ 
spective (ii) standard deviation cr, (hi) the average bias A 
(see Eq. 3), and (iv) the design effect D.E. (Eq. 4) (See 
Section II.D). Figures 3a-f,m-r show the reference case, i.e. 
the configuration model where the only structure is in the 
degree sequence and the rest is random (See Section II.A). 
In this reference case, if there is no restriction to the sam¬ 
ple size in respect to the size of the target population (i.e. 


up to 10000 individuals may be sampled, but actual sample 
size depends on the response-rate), the estimator 6 performs 
well, although with substantial standard deviation a and bi¬ 
ases A for p < 0.4 even if the quantity A is uniformly spread 
in the network (Fig. 3a-c). This is a result of the insufficient 
sample size for low response-rates. 

Individuals with a large number of contacts are believed 
to be more central in a network^^. These individuals may 
be for example more likely to get an infection or propagate a 
piece of information. We thus test an hypothetical scenario 
where A is concentrated in high-degree nodes (See proto¬ 
col PRI in Section II.C). Note that this assumption may 
be however completely irrelevant in some contexts, but it is 
useful to understand the mechanisms of sampling. In this 
case, the accuracy of the estimator 9 is poor for p > 0.3, 
i.e. A is under-estimated for both situations, with and with¬ 
out many triangles, and precision is worse for response-rates 
p < 0.35 (respectively Fig. 3j-l and Fig. 3d-f). As before, 
the poor accuracy is a result of the RDS not recruiting suf¬ 
ficient participants. The under-estimation of the prevalence 
however suggests that low-degree nodes are not being suffi¬ 
ciently sampled as the sample size gets close to the network 
size. A substantial bias, given by A, is also observed. The 
design effect varies between 1 and 2, with some exceptions 
for p ~ 0.4 in case of many triangles. If the number of par¬ 
ticipants is limited to only 500 individuals, i.e. 5% of the 
total population (a small fraction of the target population, 
as usually recommended to guarantee unbiased estimates^), 
the performance of the estimator 9 and the average bias A 
improves substantially. However, A remains slightly under¬ 
estimated and the standard deviation cr increases in the case 
of many triangles irrespective of the response-rates (Fig. 3v- 
x). The cost of this improvement however is a much higher 
design effect (Fig. 3x). 

Figure 4a-f shows that in networks with weak community 
structure, if A is concentrated at the high-degree nodes, the 
estimates remain good for p > 0.2 if the maximum num¬ 
ber of participants is low (up to 500) compared to the total 
size of the target population. In the limiting case where 
all individuals can potentially participate (up to 10000), A 
is slightly overestimated and substantially underestimated 
respectively for small and large response-rates, being accu¬ 
rate only for moderate values, i.e. p ~ 0.4 (Fig. 4g-l). The 
results suggest that for larger response-rates, there is a sig¬ 
nificant under-representation of low-degree nodes in the final 
sample. This happens because low-degree nodes become in¬ 
creasingly more difficult to sample as the sample size gets 
close to the network size (causing finite-size effects). Biases 
are also larger if the community structure is stronger because 
the recruitment chains die out before exploring some of the 
communities. Altogether, these results are in accordance 
with previous recommendations that the sample size should 
be much smaller than the size of the target population^ in 
order to achieve good estimates using the RDSII estimator. 
Some caution however should be pointed out since it is not 
straightforward to know in advance the size of the target 
population and thus to estimate the optimal sample size in 
respect to the target population. If too many subjects are 
recruited, relatively to the size of the target population, sat¬ 
uration occurs and the network structure induces biases in 
the estimator due to finite-size effects. 
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Figure 5: Prevalence of A in the smallest communities. The panel shows a,d,g,j,m,p,s,v) the RDS estimator 9 (Eq. 2) and the respective standard 
deviation cr, b,e,h,k,n,q,t,w) the average bias A (Eq. 3), and c,f,i,l,o,r,u,x) the design effect D.E. (Eq. 2). The underlying contact networks have var¬ 
ious levels of community structure (See Section II.A), and recruitment is limited to 500 participants. In all cases, 25% of the population is infected 
with a quantity A, following either protocol SI (top 3 rows) or protocol SRI (bottom 3 rows) (See Section II.C). 


We now simulate scenarios where the variable A is concen¬ 
trated in specific communities, irrespective of the degree of 
the nodes. This is a reasonable assumption considering that 
an infection (or other particular quantities) may affect only 
the population of some geographical region, or for example, 
a particular group of injecting drug users among MSM may 
be sharing contaminated paraphernalia. By using the know 
structure of each network, we select 25% of the nodes asso¬ 
ciated to the smallest communities and infect them with the 
quantity A (See Section II.C). In this setting, the prevalence 
is underestimate and the estimator has relatively large devia¬ 
tions (Fig. 5a,g) for strong and strong-moderate community 
structure. Estimators improve for weaker community struc¬ 
ture (Fig. 5m,s). Even for weak community structure, the 
minimum average bias is about 15% (Fig. 5t), being at least 
45% in case of strong communities (Fig. 5b) for p = 1.0. 
For lower response-rates, the bias gets substantially larger, 
as in the previous experiments. The design effect is also 
significantly affected for any level of community structure 
(Fig. 5c,i,o,u). This means that for strong communities, for 
example, in order to have the same statistics as if a standard 
simple random sample was performed, the RDS needs up to 
40 times the same sample size. Furthermore, if we redis¬ 
tribute the infection of 40% randomly chosen infected nodes 
to decrease homophily, the overall quality of the statistics 


improves but still with significant bias, and larger standard 
deviation and design effect for stronger community structure 
(Fig. 5d-f,j-l,p-r,v-x). 

On the other hand, we can assume that A is unlike to 
occur in small communities because, for example, nodes as¬ 
sociated to these communities are simply less likely to get 
an infection due to isolation. Social control is also often 
higher in small groups. It may therefore be easier to behave 
in certain ways in larger groups. People who want to or who 
have particular behaviors or traits may thus decide to move 
to larger groups. To simulate this hypothetical scenario, 
we infect 25% of the nodes in the largest communities (See 
Section II.C). Figure 6a shows that A is overestimated for 
p > 0.3 for strong community structure. These estimates 
improve for weaker communities, also resulting on smaller 
standard deviations (Fig. 6g,m,s) for larger response-rates. 
The standard deviation is generally slightly larger in this 
case in comparison to the case where A is concentrated in 
the small communities. The design effect is very high for 
strong community structure (Fig. 6c,i), even if homophily is 
reduced (Fig. 6f,l). 

We perform the same analysis using networks with the 
same configuration studied until now but with higher clus¬ 
tering coefficient (between 0.5 and 0.6) and the results re¬ 
main quantitatively the same (apart for a few fluctuations). 
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Figure 6: Prevalence of A in the largest communities. The panel shows a,d,gj,m,p,s,v) the RDS estimator Q (Eq. 2) and the respective standard de¬ 
viation (7, b,e,h,k,n,q,t,w) the average bias A (Eq. 3), and c,f,i,l.o,r,u,x) the design effect D.E. (Eq. 2). The underlying contact networks have various 
levels of community structure (See Section II.A), and recruitment is limited to 500 participants. In all cases, 25% of the population is infected with A, 
following either protocol Bl (top 3 rows) or protocol BRI (bottom 3 rows) (See Section II.C). 
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Figure 7: Estimates of prevalence and sample size. The panel shows 
the estimator 9 and the respective standard deviation a for networks 
with a,b,e,f,i,j) strong and c,d,g,h,k,l) weak community structure. In the 
1st column, A is concentrated in the small communities (SRI protocol), 
in the 2nd column, A is concentrated in the large communities (BRI 
protocol), and in the 3rd column, A is concentrated in the high degree 
nodes (PRI protocol). 

This finding reinforces the previous observation that trian¬ 
gles have a relatively small impact in RDS if communities are 
present in the network. Altogether, these results show the 
key difference between clustering and homophily that was 
mentioned in the Introduction. In both scenarios, the net¬ 


work community structure and the number of triangles are 
the same, and homophily is high. In the later case (Fig. 6) 
homophily occurs inside the largest communities whereas in 
the first case (Fig. 5) it happens in the smallest communities. 
The structure-induced biases however remain relatively high 
even if the homophily is reduced by redistributing a fraction 
of the infections. 


C. Convergence and sample size 

In the previous section, we have studied the bias induced 
by the network structure and the response-rates. If we fix the 
response-rate, each realization of the simulation generates a 
different sample size due to the stochastic nature of the pro¬ 
cess. In this section, therefore, we fix the response-rate and 
analyze the effect of the sample size on the estimator. Since 
recruitment may stop at different times on each simulation, 
here we estimate the mean and standard deviation for sam¬ 
ple size S using only simulations in which the recruitment 
reaches this size S. This means that the estimates for large 
sample sizes have less data points (to calculate the mean) 
than those for small sample sizes. Previous studies report 
that in real settings, response-rates may vary between 0.3 
(for female sex-workers) and 0.7 (for MSM), with mean and 
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Figure 8: RDS estimates with seeds selected inside small communities. 
The panel shows the RDS estimator 9 (Eq. 2) and the respective stan¬ 
dard deviation a, the average bias A (Eq. 3), and the design effect D.E. 
(Eq. 4). In all cases, 25% of the population is infected with A in a-c,g-i) 
the smallest communities and in d-f,j-l) the largest communities (See 
Section II.C). a-f) networks with strong communities and g-l) networks 
with weak communities. 


Figure 9: RDS estimates with seeds selected inside large communities. 
The panel shows the RDS estimator 6 (Eq. 2) and the respective stan¬ 
dard deviation a, the average bias A (Eq. 3), and the design effect D.E. 
(Eq. 4). In all cases, 25% of the population is infected with A in a-c,g-i) 
the smallest communities and in d-f,j-l) the largest communities (See 
Section II.C). a-f) networks with strong communities and g-l) networks 
with weak communities. 


median at about 0.5^^. We thus study 3 scenarios for the 
response-rates: p = 0.4,0.5,0.6. Figure 7 shows that for 
strong community structure, 6 is slightly overestimated for 
sample sizes smaller than 100 and underestimated for larger 
sample sizes if A is concentrated in the smallest communi¬ 
ties (Fig. 7a,b). On the other hand, the prevalence of A is 
overestimated for sample sizes larger than 100 if A is concen¬ 
trated in the largest communities (Fig. 7e,f). In both cases, 
the mismatch is maximized when the sample size is between 
about 10% and 30% (i.e. 100 and 1500 participants respec¬ 
tively) of the study population. If A is concentrated in the 
high degree nodes, 0 is underestimated for increasing sample 
size (Fig. 7i,j) but not as much as for the previous cases. On 
the other hand, if the community structure is weak, the esti¬ 
mator performs well (with slight over- and under-estimation 
of the prevalence for small (Fig. 7c,d) and large (Fig. 7g,h) 
communities, except in the case of A being concentrated in 
high degree node (Fig. 7k,1). In this case, the estimates are 
only good in the range of sample sizes between 100 and 1000 
nodes. 


D. Seed-induced bias 

We have assumed so far that seeds are uniformly chosen 
within the target population. While this is a reasonable 
standard assumption in theoretical studies, it is hardly met 
in real contexts because the inherent fact that the study 
population is hard-to-reach and seed selection is non triv- 
ial^°. A biased selection of seeds can increase the bias in 
the RDS estimators as shown in Figs. 8 and 9. If seeds 
are selected only between subjects associated to small com¬ 
munities (here defined as communities with less than 200 
members), recruitment chains are generally unable to reach 
beyond those communities and thus the prevalence is over¬ 
estimated when the infection is concentrated in the smaller 
communities (Fig. 8a-c). On the other hand, the preva¬ 
lence is underestimated if the infection is concentrated in the 
larger communities (Fig. 8d-f). The mismatch in the estima¬ 
tors are particularly significant if the community structure 
is stronger, however, the prevalence is also strongly biased 
for low response-rates (and weakly biased for high response- 
rates) even if the community structure is weak (Fig. 8g-l). 
This is in contrast to the our previous findings when seeds 
are uniformly sampled (Fig. 5 and 6). 

If one selects the seeds in the largest communities (here 
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Figure 10: RDS estimates for empirical networks. The panel shows the RDS estimator 9 (Eq. 2) and the respective standard deviation cr, the average 
bias A (Eq. 3), and the design effect D.E. (Eq. 4). The contact networks are gathered empirically and correspond to different types of social relation 
and population size (See Section II.A). Recruitment is limited to 500 participants and the response-rate p covers realistic values. In all cases, 25% of 
the population is infected with a quantity A, either in the largest communities (1st column), or smallest communities (2nd column), or the high degree 
nodes (3rd column) (See Section II.C). 


defined as communities with more than 500 members), re¬ 
cruitment chains tend to stay within the largest communi¬ 
ties, which leads to an under-estimation of the prevalence 
and relatively high biases if the infection is mostly preva¬ 
lent in the small communities (Fig. 9a-c). The prevalence 
is overestimated, however, if the infection is mostly preva¬ 
lent in the largest communities (Fig. 9d-f). Results improve 
for weak community structure, but also in this case, biases 
and large standard deviations are observed for low response- 
rates (Fig. 9g-l). Note that in these experiments homophily 
is relatively weak since we use protocols SRI and BRI. 


E. Empirical networks 

In the previous sections, we have studied the impact of 
various levels of community structure and number of trian¬ 
gles in RDS estimates in contact networks generated using 
theoretical models. Although the algorithms used to gen¬ 
erate the synthetic networks include several properties of 
real-life networks, empirical networks, with their own sam¬ 


pling and scope limitations, contain correlations that may 
be challenging to reproduce theoretically. In this section, we 
analyze the RDS performance using real-life human contact 
networks in order to be able to extend the conclusions to real 
scenarios. Following the same protocols to infect preferen¬ 
tially the largest (BRI protocol) or the smallest (SRI proto¬ 
col) communities, or the high degree nodes (PRI protocol), 
we find that in most studied networks, RDS performs well to 
estimate the mean prevalence in these hypothetical scenarios 
(although the standard deviations are relatively large), with 
a small variation for different response-rates (Fig. 10). The 
estimates are worse for EMAl and ENR datasets, respec¬ 
tively, the smallest and the largest networks (See Table 1 in 
Section II.A). We see that the average bias is larger than 
10% with a few exceptions. It is also typically larger for 
p = 0.4. The design effect is generally somewhere between 1 
and 3 (one exception for p = 0.4 and EMA2), a result inline 
to previous suggestions that a design effect of 2 may be used 
as a general guideline on unknown populations®. 
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Figure 11: RDS estimates with improved seed sampling mechanism. 

The panel shows the RDS estimator 9 (Eq. 2) and the respective stan¬ 
dard deviation a, the average bias A (Eq. 3), and the design effect D.E. 
(Eq. 4). Recruitment is limited to 500 participants. In all cases, 25% 
of the population is infected with A in a-c,g-i) the largest communities 
and in d-f,j-l) the smallest communities (See Section II.C). a-f) networks 
with strong communities and g-l) networks with weak communities. 


F. Non-simultaneous seed sampling 


Recruitment trees often break down after a few waves in 
real settings. As discussed above, this is not only a conse¬ 
quence of low response-rates but also the effect of multiple 
recruitment trees, originating from different seeds, bump¬ 
ing each other and then dying ont. A practical solution to 
obtain sufficiently large sample sizes is to restart the recruit¬ 
ment with new seeds as soon as the recruitment stops. In 
this section we test the effect on the estimators if we start 
a new seed after the previous recruitment chain has com¬ 
pletely stopped, i.e. seeds start at different points in time. 
We assume that the recruitment chain stops either natu¬ 
rally or after reaching a certain size. In particular, we test 
the case when the target sample size is 500 participants out 
of the population of 10000 people, using 10 seeds as done in 
the previous sections. Each seed is allowed to recruit (suc¬ 
cessfully) at maximum 50 participants, and a new seed is 
only selected (uniformly among non-recruited nodes) when 
the current recruitment stops. As usual, the same person 
may participate only once. 

Figure 11 shows that the average estimator is affected 
and the prevalence is under-estimated if A is concentrated 
in the largest communities and over-estimated if A is more 


likely in smaller communities. The standard deviations are 
relatively large in case of strong communities (Fig. 11a,d) 
and decreases, but still maintaining relatively large val¬ 
ues, for weaker communities (Fig. llg,j). The average bias 
and design effect substantially change in comparison to the 
case when seeds are selected simultaneously, particularly for 
strong community structure. The restarting of seeds intro¬ 
duces mixing, or equivalently, random links, in the network 
structure^®. Moreover, the restriction in the size of the re¬ 
cruitment trees possibly inhibits the sampling process to 
reach the stationary state, a factor known to cause biases 
in the estimators®. The major consequence of these very 
large biases is that one is not sure that a single RDS exper¬ 
iment (as is usually the case in reality) provides a reliable 
estimate. Selecting exactly one new seed after the current 
seed has being exhausted is an hypothetical situation. This 
extreme case however illustrates that the non-simultaneous 
selecting of seeds may increase the biases substantially nnless 
only a few re-starts occur. We expect that more realistic sce¬ 
narios (e.g. initially selecting multiple seeds simultaneously 
and eventually selecting a few new seeds if the original re¬ 
cruitment trees die out) lie somewhere between this case and 
the simultaneous seed sampling studied in Section III.B. 

IV. Discussions 

Respondent-driven sampling has been proposed as an ef¬ 
fective methodology to estimate the prevalence of variables 
of interest in hard-to-reach popnlations. The approach ex¬ 
ploits information on the social contacts for both recruit¬ 
ment and weighting in order to generate accurate estimates 
of the prevalence. Social networks however are not random 
but contain patterns of connectivity that may constrain the 
cascade of sampling. In particular, nodes have a high het¬ 
erogeneity in the number of contacts, and networks typically 
have many triangles and a community structure. 

In this paper, we have studied the bias induced by com¬ 
munity structure and network triangles in the RDS by using 
both synthetic and empirical network structures with various 
levels of clustering, size, degree heterogeneity, and so on. We 
have also analyzed the impact of various response-rates in 
the estimators and quantihed the relative bias for combina¬ 
tions of parameters. Altogether, we have identified that the 
structure of social networks have a relevant impact on RDS 
leading to potential biases in the RDS estimator. The es¬ 
timator generally performs snfficiently well if response-rates 
are sufficiently high, the community structure is weak and 
the prevalence of the variable of interest is not much concen¬ 
trated in some parts of the network (low homophily). The 
high heterogeneity of the network communities implies that 
sampling chains may get constrained to certain parts of the 
network and thus the prevalence of the infection may be ei¬ 
ther under- or over-estimated depending on which part of 
the network concentrates more infections. Some parts of the 
network may only be accessed through tight bottlenecks, i.e. 
key individuals that bridge the small well-hidden sub-groups 
and the rest of the population. If these bridging nodes are 
not willing to participate in the recruitment or once they 
are recruited, recruitment trees get trapped within a group 
of nodes, oversampling them, and generating biases. 

The structure of empirical networks may vary in differ- 
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ent contexts. Consequently, the expected biases may be 
also lower or higher for certain social networks. In par¬ 
ticular, biases should increase for sparser networks because 
less paths are available between the nodes. In other words, 
there are more bridging nodes maintaining the network con¬ 
nected and thus the recruitment becomes more sensitive to 
lower response-rates. Similarly, lower biases are expected 
in denser networks. The number of network communities 
and the distribution of community sizes may be also differ¬ 
ent than the ones we consider. Many small communities 
have a significant effect in the sampling, increasing the bi¬ 
ases, because they imply on the existence of many bridging 
nodes and higher chances to divert or break down the re¬ 
cruitment. We have also assumed that those people who 
choose to not participate in the first invitation may be in¬ 
vited again. This possibly introduces a positive correlation 
between chance to answer the survey and the degree of the 
node, i.e. a tendency to oversample high degree nodes not 
related to clustering or community structure. Since this is 
not possible for response-rates p ^ 1 and we generally ob¬ 
serve relatively similar results for decreasing p, if this effect 
occurs, it is only relevant for low response-rates p. This may 
however explain why we generally observe a transition in the 
average bias at values above the critical response-rates (the 
point where a significant number of individuals is recruited). 
On the other hand, the effect of clustering and communities 
should be even higher if we assume that people cannot be in¬ 
vited more than once (or equivalently, if someone refuses to 
participate the first time, it may refuse the following times 
as well) since this is further blocking the access to certain 
parts of the network. 

To understand the effect of the participation probability p, 
we may consider the simple case where a one single coupon 
is exchanged between individuals, and the sampling is done 
with replacement^. In that case, the stochastic process is 
equivalent to a random walk process ii p = 1. The probabil¬ 
ity Pi of finding a coupon with person i is driven by the rate 
equation 

= - P^ ( 5 ) 

IX j 

3 •' 

where Aij is the adjacency matrix of the social network. 
In the case of undirected and unweighted networks, where 
each link is reciprocated and carries the same importance, 
the element matrix (f,j) of the matrix is equal to 1 if there 
is a link between i and j and zero otherwise. The study of 
this stochastic process has a long tradition in applied mathe¬ 
matics and statistical physics (e.g.^^’"*^). Relevant to our re¬ 
sults, it is known that the system converges to equilibrium if 
the underlying network is connected^. In this regime, nodes 
would be visited by coupons with a probability proportional 
to their degree and the whole network is explored, indepen¬ 
dently on the initial conditions. Equilibrium is reached after 
a characteristic time scale t defined as I/A 2 , where A 2 is the 
first non-zero eigenvalue of the Laplacian matrix driving p in 
Eq. (5). This time scale is associated to the presence of a bot¬ 
tleneck (the bridging nodes) between two strongly connected 
communities in the network. For times smaller than r, the 
random walk has essentially explored almost uniformly one 


single community, but has not sufficiently explored the other 
one. This time scale therefore provides us with a way to es¬ 
timate the minimal value of p needed for the whole graph 
to be sampled, that is 1 — _p < A 2 . The case of sampling 
with restart is related to the process of random walk with 
teleportation. In that case, the choice of the seed where to 
restart the process is known to affect the statistical proper¬ 
ties of the sampling of the network^®. A future theoretical 
exercise is to adapt those ideas to this context in order to 
improve the RDS estimators on situations where restarting 
is necessary. Furthermore, using non-backtracking random 
walks may be a possible theoretical direction to model RDS 
considering sampling without replacement. Those random 
walks avoid to go back from where they come from, at the 
previous step, and they are known to explore the network 
faster^^. 

Finally, the results of our numerical exercise suggest some 
general recommendations for studies in real settings: i. Ex¬ 
perimental researchers should be aware of the potential crit¬ 
ical bridge nodes in the study population, which may vary 
according to the characteristics of the population; ii. Ex¬ 
perimental researchers should aim to response-rates at least 
above 0.4 in order to reduce the associated biases and un¬ 
certainty of the estimates. This recommended response-rate 
may be increased if more coupons are used; iii. Attention 
should be taken on selecting the seeds as uniformly as pos¬ 
sible, particularly aiming to avoid many seeds either in the 
small or in the large groups (typically the most reachable 
individuals). The temptation to start all seeds within well- 
hidden groups may cause the recruitment to not move be¬ 
yond these groups; iv. Restarting the seeds (to get larger 
sample sizes) during the ongoing recruitment should be gen¬ 
erally avoided. A better strategy may be to either start the 
experiment with more seeds or to increase response-rates to 
avoid dropouts. 
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